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Abstract 

We investigate the relation of two fundamen- 
tal tools in machine learning, that is the 
support vector machine (SVM) for classifica- 
tion, and the Lasso technique used in regres- 
sion. We show that the resulting optimiza- 
tion problems are equivalent, in the following 
sense: Given any instance of an ^2-loss soft- 
margin (or hard-margin) SVM, we construct 
a Lasso instance having the same optimal so- 
lutions, and vice versa. 

In consequence, many existing optimization 
algorithms for both SVMs and Lasso can 
also be applied to the respective other prob- 
lem instances. Also, the equivalence allows 
for many known theoretical insights for SVM 
and Lasso to be translated between the two 
settings. One such implication gives a simple 
kcrnelized version of the Lasso, analogous to 
the kernels used in the SVM setting. Another 
consequence is that the sparsity of a Lasso 
solution is equal to the number of support 
vectors for the corresponding SVM instance, 
and that one can use screening rules to prune 
the set of support vectors. Furthermore, we 
can relate sublinear time algorithms for the 
two problems, and give a new such algorithm 
variant for the Lasso. 



1. Introduction 

Large margin classification and kernel methods, and in 
particular the support vector machine (SVM) (Cortes 
& Vapnik, 1995), are among the most popular stan- 
dard tools for classification. On the other hand, l\- 
regularized least squares regression, i.e. the Lasso es- 
timator (Tibshirani, 1996), is one of the most widely 
used tools for robust regression and sparse estimation. 

Along with the many successful practical applications 
of SVM and the Lasso in various fields, there is a 



vast amount of existing literature 1 on the two meth- 
ods themselves, considering both theory and also algo- 
rithms for each of the two. However, the two research 
topics developed largely independently and were not 
much set into context with each other so far. 

In this paper, we attempt to better relate the two prob- 
lems, with two main goals in mind: We want to show 
that on the algorithmic side, many of the existing al- 
gorithms for each of the two problems can be set into 
comparison, and can be applied to the other respec- 
tive problem. As a particular example of this idea, we 
can apply the recent sublinear time SVM algorithm by 
(Clarkson ct al., 2010) also to any Lasso problem, re- 
sulting in a new alternative sublinear time algorithm 
variant for the Lasso. 

As a second goal, we would like to relate and trans- 
fer theoretical results between the existing literature 
for SVMs and the Lasso. In this spirit, we propose a 
simple kcrnelized variant of the Lasso, being equiva- 
lent to the well-researched use of kernels in the SVM 
setting. Furthermore, we observe that by using our 
equivalence, the sparsity of a Lasso solution is equal 
to the number of support vectors for the correspond- 
ing SVM instance. Finally, we point out that screening 
rules, which are a way of pre-processing the input data 
in order to identify inactive variables for the Lasso, can 
also be applied to SVMs, to eliminate potential sup- 
port vectors and thereby reducing the problem size. 

Support Vector Machines. In this work, we focus 
on SVM large margin classifiers whose dual optimiza- 
tion problem is of the form 
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\Ax 



(1) 



This includes the commonly used soft-margin SVM 
with ^2-loss (for one or two classes, with regularized 
or no offset, with or without using a kernel). Here 
the matrix A <E M. dxn contains all n datapoints as its 
columns, and A is the unit simplex in R™, being the 



1 As of October 2012, Google Scholar returned nearly 
200'000 publications containing the term "Support Vector 
Machine", and over 10' 000 for Lasso regression. 
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set of non-negative vectors summing up to one (i.e. 
probability vectors) . We will explain the large margin 
interpretation of this optimization problem in more 
detail in Section 2. 

Lasso. On the other hand, the Lasso (Tibshirani, 
1996), is given by the quadratic program 

min \\Ax - b\t , (2) 

also known as the constrained variant of ^-regularized 
least squares regression. Here the right hand side b is 
a fixed vector b G M d , and ♦ is the £i-unit-ball in R n . 
Note that if the desired ^i-rcgularization constraint is 
not Urcllj < 1, but ||rc|| x < r for some r > instead, 
then it is enough to simply re-scale the input matrix A 
by a factor of -, in order to obtain our above formu- 
lation (2) for any general Lasso problem. 

In applications of the Lasso, it is important to distin- 
guish two alternative interpretations of the data ma- 
trix A, which defines the problem instance (2): On 
one hand, in the setting of sparse regression, the ma- 
trix A is usually called the dictionary matrix, with 
its columns A : j being the dictionary elements, and the 
goal being to approximate the single vector & by a com- 
bination of few dictionary vectors. On the other hand 
if the Lasso problem is interpreted as feature-selection, 
then each row Aj : of the matrix A is interpreted as 
an input vector, and for each of those, the Lasso is 
approximating the response bi to input Aj : . See e.g. 
(Buhlmann & van de Geer, 2011) for a recent overview 
of Lasso- type methods. 

The Equivalence. Wc will prove that the two prob- 
lems (1) and (2) are indeed equivalent, in the following 
sense: For any Lasso instance given by (A, &), we con- 
struct an equivalent SVM instance, having the same 
optimal solution. This will be a simple reduction pre- 
serving all objective values. On the other hand, the 
task of finding an equivalent Lasso instance for a given 
SVM appears to be a harder problem. Here we show 
that there always exists such an equivalent Lasso in- 
stance, and furthermore, if we are given an weakly- 
separating vector for the SVM, then we can explicitly 
construct the equivalent Lasso instance. This reduc- 
tion also applies to the ^2-loss soft-margin SVM, where 
we show that a weakly-separating vector is trivial to 
obtain. So our reduction docs not require that the 
SVM input data is separable. 

On the way to this goal, we will also explain the re- 
lation to the "non-negative" Lasso variant when the 
variable vector x is required to lie in the simplex, i.e. 

min \\Ax-b\\l . (3) 



It turns out the equivalence of the optimization prob- 
lems (1) and (3) is straightforward to see. Our main 
contribution is to explain the relation of these two op- 
timization problems to the original Lasso problem (2), 
and to study some the implications of the equivalence. 

Related Work. The early work of (Girosi, 1998) has 
already significantly deepened the joint understand- 
ing of kernel methods and the sparse coding setting of 
the Lasso. Despite its title, (Girosi, 1998) is not ad- 
dressing SVM classifiers, but in fact the e-insensitive 
loss variant of support vector regression (SVR), which 
the author proves to be equivalent to a Lasso problem 
where e then becomes the ^i-regularization. Unfortu- 
nately, this reduction does not apply anymore when 
e = 0, which is the case of interest for standard hinge- 
loss SVR, and also for SVMs in the classification set- 
ting, which are the focus of our work here. 

In a different line of research, (Li et al., 2005) have 
studied the relation of a dual variant of the Lasso to 
the primal of the so called potential SVM originally 
proposed by (Hochrciter & Obermayer, 2004), which 
is not a classifier but a specialized method of feature 
selection. 

In the application paper (Ghosh & Chinnaiyan, 2005) 
in the area of computational biology, the authors al- 
ready suggested to make use of the "easier" direction 
of our reduction, reducing the Lasso to a very particu- 
lar SVM instance. Here, the idea is to use barycentric 
coordinates to represent points in the £i-ball. Alter- 
natively, this can also be interpreted as considering an 
SVM defined by all Lasso dictionary vectors together 
with their negatives (2n many points). We formal- 
ize this interpretation more precisely in Section 3.1. 
(Ghosh & Chinnaiyan, 2005) does not address the 
SVM regularization parameter. 

Notation. The unit simplex, the filled simplex as 
well as the £i-unit-ball in IR™ are central for our inves- 
tigations, and will be denoted by 
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For a given matrix A S R dx ", we write Ai £ M. d , 
i G [l..n] for its columns. By and 1 we denote the 
all-zero and all-ones vectors in R™, and I„ is the n x n 
identity matrix. We write (A\B) for the horizontal 
concatenation of two matrices A,B. 
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2. Properties of the SVM Problem 

Linear classification using large margin is usually in- 
troduced the setting where we are given n datapoints 
Xj <E R d , together with their binary labels yi E ±1, 
for i G [l..n]. The linear classifier separating the two 
point classes by the best possible margin is found by 
the following SVM optimization problem 

v t w 
max mm y i X i , 

u)GR d i IMI2 

i.e. maximizing the minimum distance of the points to 
the hyperplane given by the normal vector w, passing 
through the origin. In our case, the dual of SVM prob- 
lem (1) is exactly such a large margin formulation, or 
more formally 

(4) 



A T 

max mm A, - , 

x£A i \\Ax\\ 2 



This can be proved in two ways, when starting from 
the non-squared version of the SVM problem ( 1 ) , that 
is min^gA ||Ac|| 2 . Computing the simple "lineariza- 
tion" dual as in (Jaggi, 2013) (which can be seen as a 
special case of Fenchel duality) directly gives (4). Al- 
ternatively, taking the standard Lagrange dual (Boyd 
& Vandenberghe, 2004, Section 5) and plugging-in the 
KKT conditions also leads to the same expression, see 
e.g. (Gartner & Jaggi, 2009, Appendix A). 

Because x € A, any feasible point x readily gives us 
a candidate classifier vector w = Ax, represented as a 
convex combination of the datapoints. The datapoints 
corresponding to non-zero entries in x are called the 
support vectors. 

A crucial and widely used observation is that both op- 
timization problems (1) and (4) are formulated purely 
in terms of the inner products of the datapoints A; := 
yiXi, meaning that they can directly be optimized in 
the kernel case (Cortes & Vapnik, 1995), were we only 
have access to the entries of the matrix A T A £ R nxn , 
but not the explicit features A G R dxn . 

It is natural to measure the quality of an approximate 
solution x to the SVM problem as the attained margin, 
which is precisely the attained value in the above dual 
problem (4). 

Definition 1. A vector w 6 M. d is called a- weakly- 
separating for the SVM instance (1) or (4) respec- 
tively, for a parameter a > 0, if it holds that 



AT 



> a Vi , 



meaning that w attains a margin of separation of a. 

Any such attained margin value in (4) for some Ax = 
w directly gives a certificate on the duality gap as the 



difference from the corresponding value ||Ar|| in prob- 
lem (1), making this quantity a useful stopping crite- 
rion for SVM optimizers, sec e.g. (Gartner & Jaggi, 
2009; Clarkson et ah, 2010). The simple perceptron 
algorithm (Rosenblatt, 1958) is known to return a a- 
weakly-separating solution to the SVM after 0(4r) 
iterations, for e := a* — a being the additive error, 
if (7* is the optimal solution to (1) and (4). 

2.1. Soft-Margin SVMs 

For the successful practical application of SVMs, the 
soft-margin concept of tolerating outliers is of cen- 
tral importance. Here we recall that also the soft- 
margin SVM variants using i^-foss, with regularized 
offset or no offset, both in the one-class and the two- 
class case, can be formulated in the form (1). This fact 
is known in the SVM literature (Scholkopf & Smola, 
2002; Keerthi et al, 2000; Tsang et al., 2005), and can 
be formalized as follows: 

The two-class soft-margin SVM with squared loss is 
given by the optimization problem 



mm 

■ffiGR d , p6R, 

s.t. 



y l ■ w T X l > p-£i Vz e [1. 



(5) 



Here C > is the regularization parameter, and 
p/ \\w\\ is the attained margin of separation. Note that 
in the classical SVM formulation, the margin parame- 
ter p is usually fixed to one instead, while p is explic- 
itly used in the equivalent Z/-SVM formulation known 
in the literature, see e.g. (Scholkopf & Smola, 2002). 
The equivalence of the soft-margin SVM dual prob- 
lem to the optimization problem (1) is stated in the 
following Lemma: 

Lemma 2. The dual of the soft-margin SVM (5) is 
an instance of the classifier formulation (1), that is 
mhXrgA II -Aril 2 , with 



A :-- 



(d+n) xri 



where the data matrix Z £ R dxn consists of the n 
columns Z\ := yiXi. 

Proof. Given in Appendix B for completeness, using 
standard Lagrange duality. □ 

Obtaining a Weakly-Separating Vector for the 

^2-loss Soft-Margin SVM. By the above lemma, 

we observe that a weakly-separating vector is trivial 

to obtain for the ^2-loss SVM. This holds without any 

assumptions on the original input data (Xi, yt). We set 

w := ( i ,) G M d+n to the all-one vector only on the 
V 1 ' 
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second block of coordinates, rescaled to unit length. 
Clearly, this direction w attains a separation margin 



of AT-, 



■4- 



in Definition 1. 



/nC 



> for all points i 



Incorporating an Offset Term. Our above SVM 
formulation also allows the use of an offset variable 
b G R to obtain a classifier that does not necessarily 
pass through the origin. Formally, the separation con- 
straints then become yi- (w T X i +b) > p—£,i Mi G [l..n]. 
This formulation can easily be seen to be equivalent 
to (5), by the standard trick of increasing the dimen- 
sionality of Xi and w by one, and adding a fixed 
value of one as the last coordinate to each of the dat- 
apoints Xi, see e.g. (Keerthi et al., 2000; Tsang et al., 
2005). As a side-effect, the offset b 2 is then also regu- 



larized in the new term 



Nevertheless, if desired, 



the effect of this additional rcgularization can be made 
arbitrary weak by re-scaling the fixed additional fea- 
ture value of from one to a larger value. 

One-class SVMs. All mentioned properties in this 
section also hold for the case of one-class SVMs, by 
setting all labels to one, resulting in the same form of 
optimization problems (1) and (4). 

3. The Equivalence 

Before we investigate the "real" Lasso problem (2) in 
the next two subsections, we will warm-up by con- 
sidering the non- negative variant (3). It is a simple 
observation that the non-negative Lasso (3) is directly 
equivalent to the SVM problem (1) by a translation: 

Translations, and the Equivalence of (1) 
and (3). By translating each column vector of the 
matrix A by the vector —b, any instance of (3) becomes 
precisely an SVM instance (1), with the data matrix 



being A := A - bl 1 G 



pdxn 



Here we have crucially 



used the simplex domain, ensuring that bl T x = b for 
any x G A. Note that the translation precisely pre- 
serves the objective values of all x. The reduction in 
the other direction is trivial by choosing b := 0. 

3.1. (Lasso < SVM): Given a Lasso instance, 
construct an equivalent SVM instance 

(This reduction is significantly easier than the other 
direction.) 

Parameterizing the £i-Ball as a Convex Hull. 

In order to represent the €i-ball ♦ by a simplex A, the 
standard concept of barycentric coordinates comes to 
help, stating that every poly tope can be represented 



as the convex hull of its vertices (Ziegler, 1995). The 
£i-ball ♦ is the convex hull of its 2n vertices, which 
are {ie^ | i G [l..n]}, illustrating why ♦ is also called 
the cross-polytope. 

The barycentric representation of the £i-ball there- 
fore amounts to using two non-negative variables to 
"represent" a real variable, which can be formalized 
as follows: Any n- vector x<> G ♦ can be written as 
x = (I„ |— I„)x A for x A G R 2 ",x A G A. Note that the 
barycentric representation is usually not a bijection, as 
there might be several x A G A representing the same 



The Equivalent SVM Instance. Given a Lasso 
instance of the form (2), that is, min^g^ \\Ax — b\\ 2 , 
we can directly parameterize the ^i-ball by the 2n- 
dimensional simplex as described above. By writing 
(I n |— In)^A for any x G ♦, the objective function 
becomes ||(^4| — A)x A — b\\^. This means we have ob- 
tained the equivalent non-negative regression problem 
of the form (3) over the domain i s £ A which, by 
our above remark on translations, is equivalent to the 
SVM formulation (1), i.e. 



mm 



Ax A 



where the data matrix is given by A := (A\—A) — 
bl T G R dx2n . The additive rank-one term bl T for 1 G 
R 2n again just means that the vector b is subtracted 
from each original column of A and —A. So we have 
obtained an equivalent SVM instance consisting of 2n 
points in R d . 

Note that this equivalence not only means that the 
optimal solutions of the Lasso and the SVM coincide, 
but indeed gives us the one-to-one correspondence of 
all feasible solutions, preserving the objective values: 
For any feasible solution x G R™ to the Lasso, we have 
a feasible SVM solution x A G R 2 ™ of the same objective 
value, and vice versa. 



3.2. (SVM ^ Lasso): Given an SVM instance, 
constructing an equivalent Lasso instance 

This reduction is harder to accomplish than the other 
direction we explained before. Given an instance of an 
SVM problem (1), we suppose that we have a (possi- 
bly non-optimal) c-weakly-separating vector w G R d 
available, for some (small) value a > 0. Given w, we 
will demonstrate in the following how to construct an 
equivalent Lasso instance (2). 

Perhaps surprisingly, such a weakly-separating vector 
w is trivial to obtain for the ^2-loss soft-margin SVM, 
as we have observed in Section 2.1 (even if the SVM 
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input data is not separable). Also for other SVM 
variants, finding such a weakly-separating vector for 
a small a is still significantly easier than the final goal 
of obtaining a near-perfect (a* — e)-separation for a 
small precision s. It corresponds to running an SVM 
solver (such as the perceptron algorithm) for only con- 
stantly many iterations. In contrast, obtaining a bet- 
ter e-accurate solution by the same algorithm would 
require O(t-) iterations, as mentioned in Section 2. 



solutions being non-negative, and lying in the sim- 
plex. Surprisingly, this can be done, and we will show 
that all SVM solutions are preserved (and no new so- 
lutions are introduced) when the feasible set A is ex- 
tended to In the following we will formalize this 
precisely, and demonstrate how to translate along our 
known weakly-separating vector w such that the re- 
sulting Lasso problem will have the same solution as 
the original SVM. 



The Equivalent Lasso Instance. Formally, we de 
fine the Lasso instance (A, b) as the translated SVM 

datapoints A := jA; + fe 
the right hand side b := — 



i G together with 

a D 2 



Here D > is a strict upper bound on the length 
of the original SVM datapoints, i.e. ||Aj|| 2 < D Vi. 
By definition of A, the resulting new Lasso objective 
function is 



Ax -b 



(A + bl T )x-~b 



Ax + (l T x - 1)6 



(6) 



Therefore, this objective coincides with the original 
SVM objective (1), for any x € A (meaning that 
l T x = 1). However, this does not necessarily hold for 
the larger part of the Lasso domain when x G ♦ \ A. 
In the following discussion and the main Theorem 3, 
we will prove that all those candidates x £ ♦ \ A can 
be discarded from the Lasso problem, as they do not 
contribute to any optimal solutions. 

As a side-remark, we note that the quantity — that de- 
termines the magnitude of our translation is a known 
parameter measuring the "difficulty" of the SVM in- 
stance, essentially its VC-dimension (Burgcs, 1998; 
Scholkopf & Smola, 2002). 

Geometric Intuition. Geometrically, the Lasso 
problem (2) is to compute the smallest Euclidean dis- 
tance of the set A^ to the point b G R d . On the other 
hand the SVM problem — after translating by & — is 
to minimize the distance of the smaller set AA C A4 
to the point b. Here we have used the notation AS := 
{Ax | x G S} for subsets S CR d and linear maps A (it 
is easy to check that linear maps do preserve convexity 
of sets, so that conv(A5*) = Aconv(5)). 

Intuitively, the main idea of our reduction is to mir- 
ror our SVM points A 2 ; at the origin, such that both 
the points and their mirrored copies — and therefore 
the entire larger polytope A4 — do end up lying "be- 
hind" the separating SVM margin. The hope is that 
the resulting Lasso instance will have all its optimal 



Properties of the Constructed Lasso Instance. 

The following theorem shows that for our constructed 
Lasso instance, all interesting feasible solutions are 
contained in the simplex A. By our previous obser- 
vation (6), we already know that all those candidates 
are feasible for both the Lasso (2) and the SVM (1), 
and obtain the same objective values in both problems. 

In other words, we have a one-to-one correspondence 
between all feasible points for the SVM (1) on one 
hand, and the subset A C ♦ of feasible points of our 
constructed Lasso instance (2), preserving all objec- 
tive values. Furthermore, we have that in this Lasso 
instance, all points in ♦ \ A are strictly worse than 
the ones in A. Therefore, we have also shown that all 
optimal solutions must coincide. 

Theorem 3. For any candidate solution x G ♦ to the 
Lasso problem (2) defined by (A, b), there is a feasible 
vector x& 6 A in the simplex, of the same or better 
Lasso objective value 7. 

Furthermore, this x& G A attains the same objective 
value 7 in the original SVM problem (1). 

The proof of the following two main propositions is 
given in Appendix A, and makes use of the defined 
the translation b along a weakly separating vector w. 

Proposition 4 (Flipping improves the objective). 
Consider the Lasso problem (2) defined by (A, b), and 
assume that x<> G ♦ has some negative entries. 

Then there is a strictly better solution x k G ▲ having 
only non-negative entries. 

Proposition 5 (Scaling up improves for non-nega- 
tive vectors). Consider the Lasso problem (2) defined 
by (A,b), and assume that x L G ▲ has H^aIIi < 1. 

Then we obtain a strictly better solution x A G A by 
linearly scaling x k . 

Proof of Theorem 3. By Propositions 4 and 5, and as- 
sumed that £<> does not already lie in the simplex, we 
have xa G A, of a strictly better objective value 7 for 
problem (3). By the observation (6) about the Lasso 
objective, we know that the original SVM objective 
attained by this x^ is equal to 7. □ 
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4. Implications &c Remarks 

4.1. Some Implications for the Lasso 

Using only the "easy" direction (Lasso < SVM) of our 
reduction in Section 3.1, we obtain the following in- 
sights into the Lasso problem: 

Sublinear Time Algorithms. The recent break- 
through SVM algorithm of (Clarkson ct al., 2010; 
Hazan et al., 2011) in time 0(e~ 2 (n + d) logn) returns 
an e-accurate solution to problem (1). Here £-accurate 
means (a* — e)-weakly-scparating. The running time 
of the algorithm is remarkable since it is significantly 
smaller than even the size of the input matrix, be- 
ing d ■ n. therefore the algorithm does not read the 
full input matrix A. More precisely, (Clarkson et al., 
2010, Corollary III. 2) proves that the algorithm pro- 
vides (w.h.p.) a solution p* G A of additive error at 
most e to 

min max w T Ap . 

IMI 2 <i 

This is a reformulation of min p T A T Ap, which is ex- 

actly our SVM problem (1), since for given p, the inner 
maximum is attained when w = Ap. Therefore, using 
our simple trick from Section 3.1 of reducing any Lasso 
instance (2) to an SVM (1) (with its matrix A having 
twice the number of columns as A), we directly obtain 
a sublinear time algorithm for the Lasso. Note that 
since the algorithm of (Clarkson et al., 2010; Hazan 
et al., 2011) only accesses the matrix A by simple 
entry- wise queries, it is not necessary to explicitly com- 
pute and store A (which is a preprocessing that would 
need linear time and storage). Instead, every entry Aij 
that is queried by the algorithm can be provided on 
the fly, by returning the corresponding (signed) entry 
of the Lasso matrix A, minus 

It will be interesting to compare this alternative al- 
gorithm to the recent more specialized sublinear time 
Lasso solvers in the line of work of (Cesa-Bianchi et al., 
2011; Hazan & Koren, 2012), which are only allowed to 
access a constant fraction of the entries (or features) 
of each row of A. If we use our proposed reduction 
here instead, the resulting algorithm from (Clarkson 
et al., 2010) has more freedom: it can (randomly) pick 
arbitrary entries of A, without necessarily accessing an 
equal number of entries from each row. 

A Lasso in Kernel Space. Traditional kernel re- 
gression techniques (Smola & Scholkopf, 2004; Saun- 
ders ct al., 1998; Roth. 2004) try to learn a real-valued 
function / from the space M. d of the datapoints, such 
that the resulting real value for each datapoint approx- 



imates some observed value. The function / is chosen 
to be a linear combination the (kernel) inner products 
to few existing datapoints in the kernel space. 

Here, as we present a kernelization of the Lasso that 
is in complete analogy to the classical kernel trick for 
SVMs, our goal is different. We are not trying to ap- 
proximate n many individual real values (one for each 
datapoint, or row of A), but instead we are search- 
ing for a linear combination of our points in kernel 
space, such that the resulting combination is close to 
the lifted point b, measured in the kernel space norm. 
Formally, suppose our kernel space % is given by an 
inner product K(y, z) = (^(y), ^{z)) for some implicit 
mapping * : W l ->• H. Then we define our kernelized 
variant of the Lasso as 

2 



mm 



(7) 



H 



Nicely, analogous to the SVM case, also this objec- 
tive function here is determined purely in terms of the 
pairwise (kernel) inner products 

An alternative way to see this is to observe that our 
simple "mirror-and-translate" trick from Section 3.1 
also works the very same way in any kernel space T-L. 
Here, the equivalent SVM instance is given by the 2n 
new points {±*(A t ) - #(&) |ie[l..n]} C U. The 
crucial observation is that the (kernel) inner product 
of any two such points is 

< fli ^)-m*i*(40 -*(&)> 

= SiSjK(Ai, Aj) — Sin(Ai, b) - SjK(Aj,b) + n(b, b) . 

Here Si,sj 6 ±1 are the signs corresponding to each 
point. Therefore we have completely determined the 
resulting 2n x 2n kernel matrix K that defines the ker- 
nelized SVM (1), namely min^gA x T Kx, which solves 
our equivalent Lasso problem (7) in the kernel space %. 

Discussion. While traditional kernel regression corre- 
sponds to a lifting of the rows of the Lasso matrix A 
into the kernel space, our approach (7) by contrast is 
lifting the columns of A (and the r.h.s. b). We note 
that it seems indeed counter-intuitive to make the re- 
gression "more difficult" by artificially increasing the 
dimension of b. Using e.g. a polynomial kernel, this 
means that we also want the higher moments of b to be 
well approximated by our estimated x. On the other 
hand, increasing the dimension of b naturally corre- 
sponds to adding more data rows (or measurements) 
to a classical Lasso instance (2). 

In the light of the success of the kernel idea for the clas- 
sification case with its existing well-developed theory, 
we think it will be interesting to relate these results 
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to the above proposed kernelized version of the Lasso, 
and to study how different kernels will affect the solu- 
tion x for applications of the Lasso. Independently of 
our work, (Thiagarajan et al., 2012) has very recently 
proposed a similar kcrnclization idea for the applica- 
tion of image retrieval. 

4.2. Some Implications for SVMs 

The more complicated direction of our reduction (SVM 
< Lasso) from Section 3.2 also gives a promising ap- 
proach towards some new insights into SVMs: 

Structure and Sparsity of the Support Vectors, 
in the View of Lasso Sparsity. There has been a 
large amount of new literature studying the sparsity of 
solutions to the Lasso and related t\ -regularized meth- 
ods, in particular the study of the sparsity of x when 
A and b are from distributions with certain properties. 
For example, in the setting known as sparse recovery, 
the goal is to approximately recover a sparse solution x 
using instances A, b (consisting of only a small number 
of rows) , where b is interpreted as a noisy or corrupted 
measurement of Ax, see e.g. (Chen et al., 2001; Porat 
& Strauss, 2012). 

Using our construction of the equivalent Lasso instance 
for a given SVM, such results then directly apply to 
the sparsity pattern of the solution to our original 
SVM (i.e. the pattern and the number of support 
vectors). More precisely, any result giving a distri- 
bution of matrices A and corresponding b for which 
the Lasso sparsity is well characterized, will also char- 
acterize the patterns of support vectors for the equiv- 
alent SVM instance (and in particular the number of 
support vectors). This assumes that a Lasso sparsity 
result is applicable for the type of translation b that 
we have used here in order to construct our equivalent 
Lasso instance. However, this is not hopeless, since 
the only special property of our constructed Lasso in- 
stance is that the r.h.s. b is strongly correlated with 
all columns of A. It remains to investigate which dis- 
tributions and corresponding sparsity results would be 
of most interest for the SVM perspective. 

Screening Rules for Support Vector Machines. 

For the Lasso, screening rules have been developed 
recently. This approach consists of a single pre- 
processing pass though the data A, in order to imme- 
diately discard those predictors A4 that can be guar- 
anteed to be inactive for the optimal solution (Ghaoui 
et al., 2010; Tibshirani et al., 2011). Translated to 
the SVM setting by our reduction, any such Lasso 
screening rule can be used to permanently discard in- 
put points before the SVM optimization is started. 



The screening rule then guarantees that any discarded 
point will not be a support vector, so the resulting op- 
timal classifier remains unchanged. We are not aware 
of screening rules in the SVM literature so far. 

5. Conclusions 

We have investigated the relation between the Lasso 
and SVMs, and constructed equivalent instances of the 
respective other problem. While obtaining an equiva- 
lent SVM instance for a given Lasso is straightforward, 
the other direction is slightly more involved (but effi- 
cient e.g. for ^2-loss SVMs). The two reductions allow 
us to better relate and compare many existing algo- 
rithms for both problems. In the future, by transfer- 
ring more of the existing rich theory between the two 
popular methods, we hope that the understanding of 
both of them can be further deepened. 
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A. Proof of the Reduction from SVM 
to Lasso 

As described in Section 3.2, here we assume that w G 
R d is some a- weakly-separating vector for the original 
SVM problem (1) for a > 0. Then we study the Lasso 

instance (A, b) defined above, where the translation 

~ n 2 
vector is defined as b := — M w u ■ —. 

IMI 2 a 

Proposition 4 (Flipping improves the objective). 
Consider the Lasso problem (2) defined by (A,b), and 
assume that x G ♦ has some negative entries. 

Then there is a strictly better solution x k G ▲ having 
only non-negative entries. 

Proof of Proposition 4. We are given cc ^ 0, having 
at least one negative coordinate. Define x k 7^ as 
the vector you get by flipping all the negative coordi- 
nates in x<>. We define 6 G ▲ to be the difference vec- 
tor corresponding to this flipping, i.e. Si := — (a; )j if 
(x )j < 0, and Si := otherwise, so that x k := x + 25 
gives x k G A. We want to show that with respect to 
the quadratic objective function, x k is strictly better 
than x . We do this by showing that the following 
difference in the objective values is strictly negative: 



Ax k 



Axe 



= ||c + d|l2-|Ml2 

= c T c + 2c T d + d T d 

= A(Ax -b+ ASfAS 

= 4(A{x + 5)- b) T AS 



T 

C C 



(2c + d) T d 



where in the above calculations we have used that 
Ax k = Axo + 2AS, and we substituted c := Ax<> — b 
and d := 2AS. Interestingly, x + S G ▲, since this 
addition just sets all previously negative coordinates 
to zero. 



The proof then follows from Lemma 8 below. 



□ 



Proposition 5 (Scaling up improves for non-nega- 
tive vectors). Consider the Lasso problem (2) defined 
by (A,b), and assume that x k G ▲ has \\xl\\i < 1. 

Then we obtain a strictly better solution x A G A by 
linearly scaling x k . 

Proof of Proposition 5. The proof follows along simi- 
lar lines as the above proposition. We are given x k =/= 
with \\x k 1^ < 1. Define x& as the vector we get by scal- 
ing up x& := Xx k by A > 1 such that ||£a||i = 1- We 
want to show that with respect to the quadratic objec- 
tive function, a; A is strictly better than x k . As in the 
previous proof, we again do this by showing that the 



following difference in the objective values is strictly 
negative: 



Axa — b 



■d\\ 



Ax k - b 



T 

c c 



(2c + d) T d 



c T c + 2c T d + d T d 
\'(2Ax k -2b + X'Ax k ) T Ax k 
2\'(A(l + ^)x k -b) T Ax k 



where in the above calculations we have used that 
Ax& = XAx k for A > 1, and we substituted c := Ax k —b 
and d := Ax A — Ax k = (A — l)Ax k =: X'Ax k for 
A' := A - 1 > 0. Note that x A := (1 + X')x k G A 



so (1 + ^)x k G ▲. 

The proof then follows from Lemma 8 below. 



□ 



Definition 6. For a given axis vector w G the 
cone with axis w, angle a G (0, ^) with tip at the origin 
is defined as cone(w,a) := {x G K rf | £(x,w) < a}, 

T o 

or equivalently ^jf || ^^ > cos a. By cone(w,a) we 
denote the interior of the convex set cone(w, a), in- 
cluding the tip 0. 

Lemma 7 (Separation). Let w be some a -weakly- 
separating vector for the SVM (1) for a > 0. Then 

i) Ak. C cone(ui, arccos(|,)) 

ii) Any vector in cone(iu, arcsin(-^)) is still a' - 
weakly-separating for A for some a' > 0. 

Proof, i) Definition 1 of weakly separating, and using 
that ||Ai|| 2 < D. 

ii) For any unit length vector v G cone(w, arcsin(-^)), 
every other vector having a zero or negative inner 
product with this v must have angle at least ^ — 
arcsin(-jj) = arccos(-^) with the cone axis w. How- 
ever, by using i), we have AA C co°ne(u;, arccos(^)), 
so every column vector of A must have strictly positive 
inner product with v, or in other words v is er'-weakly- 
separating for A (for some a' > 0). □ 

Lemma 8. Let w be some a -weakly-separating vector 
for the SVM for a > 0. Then we claim that the trans- 
lotion by the vector b := — • has the following 
properties: For any pair of vectors x,S G A, <5 ^ 0, we 
have that (Ax - b) T (-A5) > 0. 

Proof. By definition of the translation 6, we have that 
the entire Euclidean ball of radius D around the point b 
— and therefore also the point set — Ak and in partic- 
ular v := — AS — is contained in cone(w, arcsin(^)). 
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Therefore by Lemma 7 ii), v is separating for A, and by 
translation v also separates A from b. This establishes 
the result (Ax — b) T v > for any x G A. 

To extend this to the case x G ▲ , we observe that 
by definition of b, also the point — 6 has strictly 
positive inner product with v. Therefore the entire 
convex hull of A A U and thus the set A A. has the 
desired property. □ 

B. Some Soft-Margin SVM Variants 
that are Equivalent to (1) 

We include the derivation of the dual formulation to 
the £2-loss soft-margin SVM (5) for n datapoints Xi G 
M. d , together with their binary class labels y% G ±1, for 
i G as defined above in Section 2.1. 

The equivalence to (1) directly extends to the one- 
and two-class case, without or with (regularized) offset 
term, and as well for the hard-margin SVM. These 
equivalent formulations have been known in the SVM 
literature, see e.g. (Scholkopf & Smola, 2002; Keerthi 
et al., 2000; Tsang ct al., 2005; Gartner & Jaggi, 2009), 
and the references therein. 

Lemma 2. The dual of the soft-margin SVM (5) is 
an instance of the classifier formulation (1), that is 

min xe A H^lla > w ^ 



In other words the dual is 



A 



n(d+n)xn 



where the data matrix Z G R dx " consists of the n 
columns :— yiXi. 

Proof. The Lagrangian (Boyd & Vandenberghe, 2004, 
Section 5) of the soft-margin SVM formulation (5) 
with its n constraints can be written as 

L(w,p,t,a):= ||Ml2-P+fEitf 

+ X« a i {-w T Zi +p-&) . 

Here we introduced a non-negative Lagrange multi- 
plier ai > for each of the n constraints. Differentiat- 
ing L with respect to the primal variables, we obtain 
the KKT optimality conditions 



mm Q 
s.t. 



z T z+M„ 



a 

a > 
a T l = 1 



This is directly an instance of our first SVM formu- 
lation (1) used in the introduction, if we use the ex- 
tended matrix 



A 



Z 



d (rf+n) X n 



□ 



Note that any optimal primal solution w can directly 
be obtained from the dual optimum a by using the 
optimality condition w = Aa. 





= 




a 

_d_ 

dp 

d_ 



i - E» on 



a 



When plugged into the 
max Q mm w ^ p ^ L(w, p, £, a) . 
lent formulation 



Lagrange dual problem 
these give us the cquiva- 



max 



\o?Z? 

T 

— a 1 



Za — p + g- r% 



c 1 
2 W 

Z 1 Za + p — -pa a 
T Z T Za-^ 



c 

T 

a a 



